Smoking is one of the leading causes of deaths globally, contributing significantly to a wide range of health problems, including cardiovascular diseases, respiratory conditions, and various cancers. Over the years, many countries have implemented measures to reduce smoking prevalence, such as public health campaigns, tobacco taxes, smoking bans and regulations on tobacco advertising. Despite these efforts, smoking continues to affect people worldwide, with its prevalence varying significantly across regions and populations. By investigating the prevalence of smoking among adults globally from 2000 to 2020, this study aims to explore and identify patterns in smoking behaviours.
How does the prevalence of smoking in adults across the world vary and what trends emerge from this analysis?
The project is organised into key sections to help you navigate its contents. The Codebook folder provides detailed documentation on the dataset, including variable descriptions and structure. The Data Folder Contains the raw datasets. The Plots Folder contains Visualisations created during analysis. The Scripts Folder contains the code used for data processing, analysis, and visualisation.
The raw dataset for this visualisation project comes from Multiple sources compiled by World Bank (2024) – processed by Our World In Data. “Prevalence of current tobacco use (% of adults)”
This data represents the percentage of the population ages 15 years and over who currently use any tobacco product (smoked and/or smokeless tobacco) on a daily or non-daily basis. Tobacco products include cigarettes, pipes, cigars, cigarillos, waterpipes, bidis, kretek, heated tobacco products, and all forms of smokeless (oral and nasal) tobacco. Tobacco products do not e-cigarettes (which do not contain tobacco), “e-cigars”, “e-hookahs”, JUUL and “e-pipes”. The rates are age-standardized to the WHO Standard Population.
Countries with irregular surveys or significant gaps in data contribute to large uncertainties in the estimates, making these results less reliable. Furthermore, the dataset excludes information on e-cigarettes and other non-tobacco products, focusing solely on traditional tobacco use. These limitations mean that the findings should be interpreted with caution, particularly for countries with incomplete or inconsistent data, as the trends and prevalence rates presented may not fully capture the complexity of smoking behaviours globally.
# Packages to install and load
packages <- c("tidyverse", "ggplot2", "tidyr", "dplyr", "plotly", "rnaturalearth", "rnaturalearthdata", "sf", "htmlwidgets")
# Function to install packages and load them
install_and_load <- function(packages) {
for (package in packages) {
if (!require(package, character.only = TRUE)) {
install.packages(package, dependencies = TRUE)
library(package, character.only = TRUE)
} else {
library(package, character.only = TRUE)
}
}
}
# Run function
install_and_load(packages)
# Load raw data / replace "data/smoking.csv" with the path to CVS file.
rawdata <- read.csv("data/smoking.csv")
#sanity check
str(rawdata) # structure and summary
summary(rawdata)
head(rawdata) # Check first rows
dim(rawdata) # Check dimensions
colSums(is.na(rawdata))# Check missing values
# Check for duplicates
duplicates <- rawdata[duplicated(rawdata), ]
print(duplicates)
# Unique values in key columns
unique(rawdata$Entity)
unique(rawdata$Year)
The dataset initially included various entities, such as global regions and income levels. To focus on country-level analysis, these entities were excluded. Additionally, the data was streamlined to 5-year intervals for clarity, which involved removing the years 2018 and 2019. For simplicity and readability, the variable representing prevalence was also renamed.
# Cleaning data
# Remove specific entities
countries_data <- rawdata[!rawdata$Entity %in% c("East Asia and Pacific (WB)", "Sub-Saharan Africa (WB)",
"Upper-middle-income countries", "Europe and Central Asia (WB)",
"World", "European Union (27)", "Low-income countries",
"Lower-middle-income countries", "Middle East and North Africa (WB)",
"Middle-income countries", "North America (WB)", "South Asia (WB)",
"Latin America and Caribbean (WB)", "High-income countries"), ]
# Exclude specific years
countries_data <- countries_data %>%
filter(!(Year %in% c(2018, 2019)))
# Rename column for ease
countries_data <- countries_data %>%
rename(Prevalence = Prevalence.of.current.tobacco.use....of.adults.)
In this section, I loaded the world map data using the ne_countries() function, which provides a geospatial dataset containing geographic and country-level information. To ensure compatibility with the smoking data, discrepancies in ISO country codes between the world map dataset (world) and the smoking prevalence dataset (countries_data) were identified. Specifically, any codes present in countries_data but missing in world, and vice versa. I created the function fix_iso_codes() to correct the mismatched iso codes.
# Load world map data
world <- ne_countries(scale = "medium", returnclass = "sf")
# Identify codes in countries_data but not in world
missing_in_world <- setdiff(countries_data$Code, world$iso_a3)
print(missing_in_world)
# Identify codes in world but not in countries_data
missing_in_data <- setdiff(world$iso_a3, countries_data$Code)
print(missing_in_data)
# Fix missing ISO codes
fix_iso_codes <- function(world_data) {
world_data %>%
mutate(iso_a3 = ifelse(name == "France", "FRA", iso_a3)) %>%
mutate(iso_a3 = ifelse(name == "Norway", "NOR", iso_a3))
}
world <- fix_iso_codes(world)
To prepare the datasets for visualisation, I merged the world map data and the cleaned smoking prevalence data. I aligned the datasets by matching ISO codes (iso_a3 in the world dataset and Code in the smoking dataset). During this provess, all the missing prevalence values were replaced by 0, ensuring all the countries had defined values for visualisation, even if the data was not available. To make the data compatible with the interactive visualisation in Plotly, I had to remove the geometric information. A detailed inspection of the merged dataset was performed. Finally, distributions of data across years and countries were tabulated to confirm consistency for visualisation. This ensured a clean, structured dataset suitable for creating accurate and informative visualisations.
# Merge datasets, sanity check/clean merged data
# Merge datasets
map_data <- world %>%
left_join(countries_data, by = c("iso_a3" = "Code"))
# Replace NA prevalence values with 0
map_data <- map_data %>%
mutate(Prevalence = ifelse(is.na(Prevalence), 0, Prevalence))
# Remove duplicates
map_data <- map_data[!duplicated(map_data), ]
# Remove geometry for Plotly compatibility
plot_data <- map_data %>%
st_set_geometry(NULL)
# Structure and summary of cleaned dataset
str(plot_data)
summary(plot_data)
# Check for duplicates country-year pairs
duplicates <- plot_data %>%
group_by(iso_a3, Year) %>%
filter(n() > 1)
print(duplicates)
# Check for missing prevalence values
missing_prevalence <- plot_data %>%
filter(is.na(Prevalence))
print(missing_prevalence)
# Distribution of years and countries
table(plot_data$Year)
table(plot_data$iso_a3)
The cleaned dataset consists of country-level information on smoking prevalence among adults from 2000 to 2020, focusing on 5-year intervals. Key variables include Entity, which represents the name of each country; Code, the unique ISO-3 country code used for data alignment; Year, indicating the specific year of observation; and Prevalence, which provides the percentage of the adult population currently using tobacco products. The data has been cleaned to exclude irrelevant entities. Furthermore, missing values in prevalence have been accounted for, and duplicate entries removed, resulting in a comprehensive and structured dataset ideal for visualising smoking trends over time.
The initial visualisation provided a basic representation of smoking prevalence on a static map. While it successfully displayed the prevalence rates for countries, it lacked interactivity and failed to account for changes across years. This did not allow us to explore how smoking prevalence evolved over time or access specific data points for each country. The map presented a single snapshot without the ability to delve into detailed information, such as individual country names or exact prevalence percentages. These limitations highlighted the need for a more dynamic and informative visualisation.
# first visualisation
ggplot(data = map_data) +
geom_sf(aes(fill = Prevalence), color = "white", size = 0.2) +
scale_fill_gradient(low = "blue", high = "red", na.value = "grey") +
labs(title = "Global Smoking Prevalence Heatmap",
fill = "Prevalence (%)") +
theme_minimal()
My next visualisation aimed to be more interactive. This time, I managed to get individual interactive visualisations for each year. However, this did not allow us to visualise the evolution across time in a single plot.
# Function to create an interactive plot for a specific year
# Create a list of subsets by year
yearly_subcategories <- split(plot_data, plot_data$Year)
create_interactive_year_plot <- function(data, year) {
plot <- ggplot(data = data %>% filter(Year == year)) +
geom_sf(aes(fill = Prevalence, text = paste("Country:", Entity, "<br>Prevalence:", Prevalence, "%")),
size = 0.2) + # Removed black borders
scale_fill_gradient(low = "lightyellow", high = "darkred", na.value = "white", name = "Prevalence (%)") +
labs(
title = paste("Global Smoking Prevalence -", year),
fill = "Prevalence (%)"
) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5, size = 16),
legend.position = "bottom",
axis.title = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank(),
panel.grid = element_blank()
)
# Convert ggplot to interactive plotly object with hover text
ggplotly(plot, tooltip = c("text"))
}
# Create interactive visualisations for each year
interactive_plot_2000 <- create_interactive_year_plot(map_data, 2000)
interactive_plot_2005 <- create_interactive_year_plot(map_data, 2005)
interactive_plot_2010 <- create_interactive_year_plot(map_data, 2010)
interactive_plot_2015 <- create_interactive_year_plot(map_data, 2015)
interactive_plot_2020 <- create_interactive_year_plot(map_data, 2020)
To create my final visualisation, I combined the previous data sets of the individual years into one single data set.
For the final visualisation, I needed to merge the individual datasets for each year back into a single dataset. Initially, I used the data set countries_data to create my visualisation, however it resulted in unexpected behaviour when combined with the plotly package. Seperating and combining the datasets allowed me to correct the visualisation. Additionally, I added a new colour scheme with adjusted thresholds to better differentiate smoking prevalence rates across countries. This approach improved clarity and enhanced the visual impact of the data. I also incorporated all years into a unified visualisation of all years, which allows us to observe trends over time seamlessly. Finally, I added descriptive titles and annotations to provide context and guide interpretation, making the visualisation more informative.
# Combine all yearly subcategories back into a single dataset
combined_data_year <- bind_rows(yearly_subcategories)
# Enhanced plot
final_plot <- plot_ly(
data = combined_data_year,
type = "choropleth",
locations = ~iso_a3,
locationmode = "ISO-3",
z = ~Prevalence,
frame = ~Year,
text = ~ifelse(
is.na(Prevalence),
NA,
paste("Country:", Entity, "<br>Prevalence:", Prevalence, "%") # Show prevalence for available data
),
hoverinfo = "text", # Show only custom hover text
colorscale = list(
list(0, "#ffeda0"), # Light yellow for low prevalence
list(0.2, "#feb24c"), # Light orange
list(0.4, "#fd8d3c"), # Orange
list(0.6, "#fc4e2a"), # Reddish-orange
list(0.8, "#e31a1c"), # Red
list(1, "#67000d") # Dark red for high prevalence
),
zmin = 0,
zmax = 70, # Adjusted prevalence
showscale = TRUE,
marker = list(line = list(color = "grey", width = 0.5)) # Border for countries
) %>%
layout(
title = "Global Smoking Prevalence in Adults (2000-2020)",
geo = list(
projection = list(type = "equirectangular"), # Rectangular projection
showcoastlines = TRUE,
coastlinecolor = "grey", # Coastline color
showcountries = TRUE, # country borders
countrycolor = "grey", # Country border color
showland = TRUE, # Highlight land
landcolor = "white", # Land color
showocean = TRUE, # Highlight oceans
oceancolor = "lightblue", # Ocean color
showframe = FALSE # Remove map frame
),
annotations = list(
# Note annotation
list(
x = 0.5, # Horizontal
y = -0.1, # Vertical (below the map)
xref = "paper", # x-axis
yref = "paper", # y-axis
text = "Note: White regions indicate missing data.", # Annotation
showarrow = FALSE, # No arrow
font = list(size = 12, color = "black"), # Font
align = "left"
),
# Source annotation
list(
x = 0.5, # Horizontal
y = -0.15, # Vertical (below the map)
xref = "paper", # x-axis
yref = "paper", # y-axis
text = "Source: Multiple sources compiled by World Bank (2024) – processed by Our World In Data",
showarrow = FALSE, # No arrow
font = list(size = 10, color = "black"), # Font
align = "center"
)
)
)
# Show plot
final_plot
This final visualisation allows us to see the evolution of the global smoking prevalence from 2000 to 2020. We can clearly see that overall, the smoking prevalence is going down in the majority of countries.However, we can now compare the rates in different countries overtime. In some countries, the prevalence has drastically gone down over the years, in India the prevalence in 2000 was 54.4% compared to 27.2% in 2020. Other countries like France have not seen the prevalence go down as much (France had a prevalence of 34.2% in 2000 compared to 33.4% in 2020).
In the future, I would like to combine this data with data regarding the implemented measures to reduce smoking prevalence in the different countries, which could allow us to see the impact these measures have and potentially explain the different rates the prevalence changes across the world.
Data : Multiple sources compiled by World Bank (2024) – processed by Our World in Data. “Prevalence of current tobacco use (% of adults)” dataset. World Health Organization (via World Bank), “World Development Indicators” [original data]. Retrieved December 9, 2024 from https://ourworldindata.org/grapher/share-of-adults-who-smoke